Two problems and solutions in evolutionary corpus-based language dynamics research
The advection effect and frequency change bias in semantic change



Andres Karjus, Richard A. Blythe, Simon Kirby, Kenny Smith



University of Edinburgh




New Directions in Language Evolution Research @ SLE Tallinn 2018

Modelling interaction in the lexicon

1. Frequency change bias in semantic change measures

Testing approach

Interrim conclusions (part 1)

2. Fluctuations on topic frequencies

The topical-cultural advection model

How does this work?

How well does it work?

Conclusions (part 2)

Future work









Appendix

Math and parameters (the frequency change - semantic change simulation)

For each \(w\) with an original frequency \(f\), and each \(s\), downsampled by randomly relabeling a fixed portion of its occurrences as \(w'\) in the corpus, where the portion is defined as \(e^{ln(f) - s} = f/e^s\) (exluded downsamples with \(<10\) occurrences)
E.g., if \(f=1000\), \(s=0.7\), then \(e^{ ln(1000) - 0.7} \approx 496\), or a -50.3% reduction.

On the heatmaps, \(w\) being closest synonym to \(w'\) is marked as False if this occurs in any of the 10 replicates. The downsampling procedure included a 0-downsample as a sanity check, which entailed no reduction in frequency, i.e. 100% of the occurrences were sampled; these are also displayed on the plots as the first column/value.

Semantic vector space performace measure

Math (the topical-cultural advection model)

The advection value of a word in time period \(t\) is defined as the weighted mean of the changes in frequencies (compared to the previous period) of those associated words. More precisely, the topical advection value for a word \(\omega\) at time period \(t\) is

\[\begin{equation} {\rm advection}(\omega;t) := {\rm weightedMean}\big( \{ {\rm logChange}(N_i;t) \mid i=1,...m \}, \, W \big) \end{equation}\]

where \(N\) is the set of \(m\) words associated with the target at time \(t\) and \(W\) is the set of weights (to be defined below) corresponding to those words. The weighted mean is simply

\[\begin{equation} {\rm weightedMean}(X, W) := \frac{\sum x_i w_i }{\sum w_i} \end{equation}\]

where \(x_i\) and \(w_i\) are the \(i^{\rm th}\) elements of the sets \(X\) and \(W\) respectively. The log change for period \(t\) for each of the associated words \(\omega'\) is given by the change in the logarithm of its frequencies from the previous to the current period. That is,

\[\begin{equation} {\rm logChange}(\omega';t) := \log[f(\omega';t)+1] - \log[f(\omega';t-1)+1] \end{equation}\]

where \(f(\omega';t)\) is the number of occurrences of word \(\omega'\) in the time period \(t\). Note we add \(1\) to these frequency counts, to avoid \(\log(0)\) appearing in the expression.

Parameters




*This research was supported by the scholarship program Kristjan Jaak, funded and managed by the Archimedes Foundation in collaboration with the Ministry of Education and Research of Estonia.